28 July 2017
Familiar yourself with the iris dataset. Typing iris into R console should load this data. Pay attention to its column, row names and structure of each column.
Please run these codes on your laptop,
## Might be a while...
install.packages(c("ggplot2","dplyr","tidyr","janitor","plotly",
"devtools","learnr","gapminder"))
library(devtools)
install_github("kevinwang09/2017_STAT3914", subdir = "learnr3914")
Statisticians are great at many things:
But the mother of all these, i.e. preparing data is not trivial. (e.g. STAT2xxx lab exams)
readr and readxljanitormagrittrdplyrggplot2base R functions haven't adapted to the needs of modern users.
readr functions are superior in data import warnings, column type handling, speed, scalability and consistency.
library(readr)
dirtyIris = read_csv("dirtyIris.csv")
class(dirtyIris)
[1] "tbl_df" "tbl" "data.frame"
dirtyIris
# A tibble: 250 x 6
SepAl....LeNgth `Sepal.? Width` `petal.Length(*&^` `petal.$#^&Width`
<dbl> <dbl> <dbl> <dbl>
1 5.80000000 2.6 4.0 1.2
2 5.80000000 2.7 5.1 1.9
3 0.01874617 NA NA NA
4 NA NA NA NA
5 NA NA NA NA
# ... with 245 more rows, and 2 more variables: `SPECIES^` <chr>,
# allEmpty <chr>
readxl and haven (for SAS, SPSS etc.) packages work similarly.tibble is a data.frame with better formatting.dirtyIris dataset.Clean data is a data set that allows you to do statistical modelling without extra processing
Cricket_australia vs cricket.Australia)Clean data is a well-designed data.frame.
base R functions might struggle with all issues we mentioned.
character or Date, not a factor ordered alphabetically (default in R).Basic data cleaning using janitor package.
More advanced data manipulation through dplyr.
Our goal: clean the dirtyIris data to be exactly the same as the original iris data.
library(janitor) dirtyIris
# A tibble: 250 x 6
SepAl....LeNgth `Sepal.? Width` `petal.Length(*&^` `petal.$#^&Width`
<dbl> <dbl> <dbl> <dbl>
1 5.80000000 2.6 4.0 1.2
2 5.80000000 2.7 5.1 1.9
3 0.01874617 NA NA NA
4 NA NA NA NA
5 NA NA NA NA
# ... with 245 more rows, and 2 more variables: `SPECIES^` <chr>,
# allEmpty <chr>
## Clean up column names better = clean_names(dirtyIris) better
# A tibble: 250 x 6
sepal_length sepal_width petal_length petal_width species allempty
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 5.80000000 2.6 4.0 1.2 versicolor <NA>
2 5.80000000 2.7 5.1 1.9 virginica <NA>
3 0.01874617 NA NA NA <NA> <NA>
4 NA NA NA NA <NA> <NA>
5 NA NA NA NA <NA> <NA>
# ... with 245 more rows
## Removing empty rows/columns evenBetter = remove_empty_rows(better) evenBetter = remove_empty_cols(evenBetter) evenBetter
# A tibble: 209 x 5
sepal_length sepal_width petal_length petal_width species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.80000000 2.6 4.0 1.2 versicolor
2 5.80000000 2.7 5.1 1.9 virginica
3 0.01874617 NA NA NA <NA>
4 6.50000000 3.0 5.5 1.8 virginica
5 6.30000000 2.9 5.6 1.8 virginica
# ... with 204 more rows
Genuinely missing values should be retained, but in this case, the NA's were added.
Removing any rows with NA
evenBetterBetter = remove_missing(evenBetter) cleanIris = evenBetterBetter
glimpse(cleanIris)
Observations: 150 Variables: 5 $ sepal_length <dbl> 5.8, 5.8, 6.5, 6.3, 5.1, 6.2, 5.9, 6.8, 5.0, 6.7,... $ sepal_width <dbl> 2.6, 2.7, 3.0, 2.9, 3.8, 2.8, 3.0, 3.2, 3.0, 3.0,... $ petal_length <dbl> 4.0, 5.1, 5.5, 5.6, 1.9, 4.8, 4.2, 5.9, 1.6, 5.0,... $ petal_width <dbl> 1.2, 1.9, 1.8, 1.8, 0.4, 1.8, 1.5, 2.3, 0.2, 1.7,... $ species <chr> "versicolor", "virginica", "virginica", "virginic...
glimpse(iris)
Observations: 150 Variables: 5 $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,... $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,... $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,... $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,... $ Species <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...
mean(cleanIris$sepal_length)
[1] 5.843333
plot(density(cleanIris$sepal_length), col = "red", lwd = 2)
We introduce a new notation: " x %>% f " means "f(x)". We call this operation as "x pipe f".
Compounded operations are possible. Keyboard shortcut is Cmd+shift+M.
cleanIris$sepal_length %>% mean
[1] 5.843333
cleanIris$sepal_length %>% density %>% plot(col = "red", lwd = 2)
sepal_length and sepal_width in the cleanIris data:## Assuming you know the position of column names.
## But what if you resample your data?
cleanIris[1:2, c(1, 2)]
## Assuming you know the position of column names.
## Also assuming the first two columns satisfy certain properties.
cleanIris[1:2, c(T, T, F, F, F)]
## Much better!
## What if you can't type out all the column names
## due to the size of your data?
cleanIris[1:2, c("sepal_length", "sepal_width")]
cleanIris[(cleanIris[,"sepal_length"] < 5) &
(cleanIris[,"sepal_width"] < 3), c("petal_length", "sepal_length")]
# A tibble: 4 x 2
petal_length sepal_length
<dbl> <dbl>
1 1.3 4.5
2 1.4 4.4
3 3.3 4.9
4 4.5 4.9
R user might know about the subset function, but it suffers the same problem of not able to have multiple subsetting criteria without predefined variables.subset(cleanIris,
subset = (sepal_length < 5) & (sepal_width < 3),
select = grep("length", colnames(cleanIris), value = TRUE))
# A tibble: 4 x 2
sepal_length petal_length
<dbl> <dbl>
1 4.5 1.3
2 4.4 1.4
3 4.9 3.3
4 4.9 4.5
select columns are operations on variables.filter rows are operations on observations.
See dplyr cheatsheet.
library(dplyr)
cleanIris %>%
filter(sepal_length < 5, sepal_width < 3) %>%
select(contains("length"))
# A tibble: 4 x 2
sepal_length petal_length
<dbl> <dbl>
1 4.5 1.3
2 4.4 1.4
3 4.9 3.3
4 4.9 4.5
mutate creates new columnsiris_mutated = mutate(cleanIris,
newVar = sepal_length + petal_length,
newNewVar = newVar*2)
iris_mutated
# A tibble: 150 x 7
sepal_length sepal_width petal_length petal_width species newVar
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 5.8 2.6 4.0 1.2 versicolor 9.8
2 5.8 2.7 5.1 1.9 virginica 10.9
3 6.5 3.0 5.5 1.8 virginica 12.0
4 6.3 2.9 5.6 1.8 virginica 11.9
5 5.1 3.8 1.9 0.4 setosa 7.0
# ... with 145 more rows, and 1 more variables: newNewVar <dbl>
group_by + summarise will create summary statistics for grouped variablesbySpecies = cleanIris %>% group_by(species) bySpecies
# A tibble: 150 x 5
# Groups: species [3]
sepal_length sepal_width petal_length petal_width species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.8 2.6 4.0 1.2 versicolor
2 5.8 2.7 5.1 1.9 virginica
3 6.5 3.0 5.5 1.8 virginica
4 6.3 2.9 5.6 1.8 virginica
5 5.1 3.8 1.9 0.4 setosa
# ... with 145 more rows
bySpecies %>% summarise(meanSepalLength = mean(sepal_length))
# A tibble: 3 x 2
species meanSepalLength
<chr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
left_join for merging dataflowers = data.frame(species = c("setosa", "versicolor", "virginica"),
comments = c("meh", "kinda_okay", "love_it!"))
## cleanIris has the priority in this join operation
iris_comments = left_join(cleanIris, flowers, by = "species")
## Randomly sampling 6 rows
sample_n(iris_comments, 6)
# A tibble: 6 x 6
sepal_length sepal_width petal_length petal_width species comments
<dbl> <dbl> <dbl> <dbl> <chr> <fctr>
1 6.2 2.8 4.8 1.8 virginica love_it!
2 6.7 3.3 5.7 2.1 virginica love_it!
3 4.6 3.2 1.4 0.2 setosa meh
4 5.4 3.0 4.5 1.5 versicolor kinda_okay
5 7.7 3.8 6.7 2.2 virginica love_it!
6 5.8 2.7 5.1 1.9 virginica love_it!
arrange for ordering rowsarrangeCleanIris = cleanIris %>% arrange(sepal_length, sepal_width, petal_length, petal_width) ## The true iris data arrangeIris = iris %>% clean_names() %>% arrange(sepal_length, sepal_width, petal_length, petal_width)
dirtyIris data and the arranged iris data.## The `Species` column is character or factor all.equal(arrangeCleanIris, arrangeIris)
[1] "Incompatible type for column `species`: x character, y factor"
arrangeIris = arrangeIris %>% mutate(species = as.character(species)) ## Great! all.equal(arrangeCleanIris, arrangeIris)
[1] TRUE
## identical is stronger version of all.equal ## arrangeCleanIris is a tibble ## but arrangeIris is a data.frame identical(arrangeCleanIris, arrangeIris)
[1] FALSE
class(arrangeCleanIris)
[1] "tbl_df" "tbl" "data.frame"
class(arrangeIris)
[1] "data.frame"
## Now, they are exactly the same! identical(arrangeCleanIris, arrangeIris %>% as.tibble)
[1] TRUE
cleanIris %>%
select(contains("length"))
# A tibble: 150 x 2
sepal_length petal_length
<dbl> <dbl>
1 5.8 4.0
2 5.8 5.1
3 6.5 5.5
4 6.3 5.6
5 5.1 1.9
# ... with 145 more rows
bySpecies %>%
summarise_if(is.numeric,
funs(m = mean))
# A tibble: 3 x 5
species sepal_length_m sepal_width_m petal_length_m petal_width_m
<chr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
ggplot2: the best visualisation packageIf you managed to install all packages successfully, you should be able to run the following to get an interactive tutorial sheet.
Otherwise, please download and compile the "ggplot2_basic_tutorial.Rmd" here
library(learnr3914) learnggplot2()
ggplot2 is because its design will force you to use a certain grammar when producing a plot. So much so, every plot you produce is actually a statistic.library(ggplot2)
p1 =
iris %>%
ggplot(aes(x = Sepal.Length,
y = Sepal.Width,
colour = Species)) +
geom_point(size = 3)
p1
data.frame?tidyverse: tidy data + tidy codingtidyverse is a collection of 20+ packages built on the philosophy that both your data and coding should be tidy.
plotly::ggplotly(p1)
Computational tools and organisation.
Use RStudio + RMarkdown to document your codes.
You don't have to adopt everything I taught you today!
Find the component that you like to use and adapt that into your work routine. (Hint: start with cheatsheets and build up gradually.)
Take time to re-analyse an old dataset.
Learn core functions and vignette.
Don't forget the theories and interpretations! This is a course about statistics after all, not Cranking-Out-Numbers-Less-Than-0.05-And-Reject-Null-Hypothesis-101.
sessionInfo()
R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.3 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] bindrcpp_0.2 janitor_0.3.0 plotly_4.7.0 dplyr_0.7.2 [5] purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3 tibble_1.3.3 [9] ggplot2_2.2.1 tidyverse_1.1.1 knitr_1.16 loaded via a namespace (and not attached): [1] reshape2_1.4.2 haven_1.1.0 lattice_0.20-35 [4] colorspace_1.3-2 htmltools_0.3.6 viridisLite_0.2.0 [7] yaml_2.1.14 rlang_0.1.1 foreign_0.8-69 [10] glue_1.1.1 modelr_0.1.0 readxl_1.0.0 [13] bindr_0.1 plyr_1.8.4 stringr_1.2.0 [16] munsell_0.4.3 gtable_0.2.0 cellranger_1.1.0 [19] rvest_0.3.2 htmlwidgets_0.9 psych_1.7.5 [22] evaluate_0.10.1 labeling_0.3 forcats_0.2.0 [25] httpuv_1.3.5 crosstalk_1.0.0 parallel_3.4.1 [28] broom_0.4.2 Rcpp_0.12.12 xtable_1.8-2 [31] scales_0.4.1.9002 backports_1.1.0 jsonlite_1.5 [34] mime_0.5 mnormt_1.5-5 hms_0.3 [37] digest_0.6.12 stringi_1.1.5 shiny_1.0.3 [40] grid_3.4.1 rprojroot_1.2 tools_3.4.1 [43] magrittr_1.5 lazyeval_0.2.0 pkgconfig_2.0.1 [46] data.table_1.10.4 xml2_1.1.1 lubridate_1.6.0 [49] assertthat_0.2.0 rmarkdown_1.6 httr_1.2.1 [52] R6_2.2.2 nlme_3.1-131 compiler_3.4.1